22 research outputs found
What can generic neural networks learn from a child's visual experience?
Young children develop sophisticated internal models of the world based on
their egocentric visual experience. How much of this is driven by innate
constraints and how much is driven by their experience? To investigate these
questions, we train state-of-the-art neural networks on a realistic proxy of a
child's visual experience without any explicit supervision or domain-specific
inductive biases. Specifically, we train both embedding models and generative
models on 200 hours of headcam video from a single child collected over two
years. We train a total of 72 different models, exploring a range of model
architectures and self-supervised learning algorithms, and comprehensively
evaluate their performance in downstream tasks. The best embedding models
perform at 70% of a highly performant ImageNet-trained model on average. They
also learn broad semantic categories without any labeled examples and learn to
localize semantic categories in an image without any location supervision.
However, these models are less object-centric and more background-sensitive
than comparable ImageNet-trained models. Generative models trained with the
same data successfully extrapolate simple properties of partially masked
objects, such as their texture, color, orientation, and rough outline, but
struggle with finer object details. We replicate our experiments with two other
children and find very similar results. Broadly useful high-level visual
representations are thus robustly learnable from a representative sample of a
child's visual experience without strong inductive biases.Comment: 26 pages, 14 figures, 3 tables; code & all pretrained models
available from https://github.com/eminorhan/silicon-menageri